Now that we have cleaned our indicators from the previously described problems we will be focusing in selecting a good model to perform text similarity.
When working with NLP and text similarity there are a few ways in which we can approach the problem, being one of the most used the vectorizers. This type of algorithms count the words present in a sentence and check if they're in other sentence we are comparing, so the more words exist in both texts the more similar they will be. The most well-known method for doing so is the TF-IDF Vectorizer, it is better than the basic Count Vectorizer as it considers overall document weightage of a word, resulting in a much less biased model.
Even though these algorithms are decent for simple tasks, in this case we will have tons of text data from formal documents including synonyms and very specific words that won't be repeated as much, thus we will not be using them but instead we will try to apply pre-trained transformers (word embeddings) from HuggingFace, specifically we will try the next three models (selected by the performance ranking that can be found here):
all-mpnet-base-v2 - Best performance overall, not very heavy.
gtr-t5-xxl - Huge model trained with 2B+ question-answer pairs, slightly better overall performance than sentence-t5-xxl.
all-roberta-large-v1 - Third best performing sentence embedding, not as heavy as gtr-t5-xxl.
Unfortunately, the gtr-t5-xxl model is too big to load it into our RAM, so we will instead replace it with (the apparently deprecated) stsb-roberta-large.
As we do not have pairs of sentences to train or to evaluate the model, we will encode the indicators and make two clusters, interpreting one will be general and the other will be cultural so we can have an idea of which of the models perform the best. This will not be an ndicators Analysis, but rather check whether or not the following models will be able to handle our data.
Surprisingly, the MPNET turns out to be very far from what we expected. It doesn't mean it is worse, but for our objective it is not as good as the other two, as it's worse in every aspect. The Roberta-All has a slightly better accuracy than the Roberta-stsb, but the later one has a considerably higher F1-Score as well as a better distributed confusion matrix. With this in mind, we will be using for this project the Roberta-stsb model.
Now that we have chosen a model we will be exploring what it has to offer, starting with a 3D visualization of our indicators when compressed through a three components PCA to visually understand if there is a difference between cultural (red) and general (blue) indicators.
We see there's indeed a reasonable difference between the two types of indicators, even though it is not as strong as it could be.
In order to understand if the model is good enough to accomplish what we need, we will try to cluster the three components into different groups so we can have smaller and more consistent chunks of indicators. In order to do that we will use the KMeans algorithm.
As we can see the best number of clusters for having the most efficient indicator representation is 3. In the plot below you can click on the legend (type in the right side general - cultural) to hide the type value you want to ignore so you can focus in the distribution of one of them.
The differences between the clusters are not huge, but they do exist. We can see in the plot above that the distribution is far from being random, and it shows that the second cluster (Pink one) is the one that contains the cultural indicators, with a p-value of 2.10e-32, completely rejecting the null hypothesis.
This is due to the low number of cultural indicators compared to the general ones. As we are just using 3 clusters we should expect 1096 / 3 indicators for each group approximately (around 365), but we only have 274 cultural indicators and some of them are not that exclusive so we can expect some "false negatives". Overall, the ideas behind these representations can be summarized as:
Overall it seems pretty good, and it is describing all the data only with three dimensions. We will now try to reduce the original dimensions of the base encoder, but not as much as we did so it can be even more precise for this task.
In order to "fine-tune" the model (it's not really fine tuning as we're not looking to improve its "accuracy" or any other metric as it's unsupervised learning) we will reduce the dimensionality of the encoder to a fair number that is not 3 and so preventing both, overfitting and underfitting.
In this graph we can see the explained variance based in the PCA number of components, and it is clear that most of the variance is explained by less than 100 components (84%) but we can even reduce more this number without losing so much information as 50 components would be enough to explain the 70% of the variance. We will take a look closer to see the optimum number of components.
The most important breakpoints we have when looking at the graph are the following:
44% Explained Variance: 16 columns
58% Explained Variance: 29 columns
72% Explained Variance: 52 columns
Looking at these breakpoints, it seems that the best option would be to retain 52 columns and get the 72% of the explained variance to prevent overfitting.
Now that we have created the encoding matrix we can look at the indicators that have the most similarities between them (computing it with cosine similarity) and the ones that don't really match with any other, so we can validate that our model is working properly.
As we can see in the above histogram, most of the indicators (+50%) are fairly good matches, considering +0.75 as a good match. We will now check the best and worst matches between indicators to see its performance in extreme cases.
We can see in the top 15 matches that most of them are lexic variations (Access to public transport - Access to public transportation), minimal changes in long sentences (heatwave - urban heat island) and the most important case, synonyms recognition or context (Water use - Water consumption).
In conclusion, for the best Matches our model works pertty well, even better than expected in some cases (Consumption - Use)
Looking at the worst matches we can see very good matches (Anthropogenic marks and footprints of human influence - Ethnic heterogeneity) and some bad matches (Site micro-climate - personality and neighbourhood) as well as some cleaning errors in translation (pubic spaces, although it is able to match it reasonably well). It is not clear whether we should keep these indicators as they might interfere in the final result (as overfitting), in order to decide properly we will check 15 more indicators.
Again we face the same problem, but it seems it is better than the last 15 as it's not completely clear that these indicators are indeed bad matches (excluding Porosity and Critical distances) but there are some that doesn't find a proper match (Critical distances in the network). Even if there are some that are not as good as they should be, the model works well enough not to remove indicators under a certain threshold.
We can see in this random sample that our model does work very well regardless of the similarity value, even in the "bad matches" there were some accurate relations between some indicators that were not obvius (Activity rate - Growth rate). We will leave it as it is for now.
One of our objectives was to match general with cultural indicators, and so we will be doing here. Right below there is a sample of these matches:
In the 3D visualization below we will be able to see the top 5 matches for the selected indicator and their positions in the 3 components PCA. Note that this filter will not be available in the HTML file as the filter cannot be embedded as a filter, it will only be available in the notebook version.